HGTree v2.0: a comprehensive database update for horizontal gene transfer (HGT) events detected by the tree-reconciliation method

Abstract HGTree is a database that provides horizontal gene transfer (HGT) event information on 2472 prokaryote genomes using the tree-reconciliation method. HGTree was constructed in 2015, and a large number of prokaryotic genomes have been additionally published since then. To cope with the rapid rise of prokaryotic genome data, we present HGTree v2.0 (http://hgtree2.snu.ac.kr), a newly updated version of our HGT database with much more extensive data, including a total of 20 536 completely sequenced non-redundant prokaryotic genomes, and more reliable HGT information results curated with various steps. As a result, HGTree v2.0 has a set of expanded data results of 6 361 199 putative horizontally transferred genes integrated with additional functional information such as the KEGG pathway, virulence factors and antimicrobial resistance. Furthermore, various visualization tools in the HGTree v2.0 database website provide intuitive biological insights, allowing the users to investigate their genomes of interest.


INTRODUCTION
Genetic materials are often inherited from parents to their offspring vertically, but for prokaryotes such as bacteria or archaea (that reproduce asexually), the inheritance can be demonstrated horizontally during species interactions, from one adjacent organism to another, for their evolutionary benefit (1). The asexual reproduction of prokaryotes gives nearly identical genetic information to the offspring, making horizontal gene transfer (HGT) an essential process in achieving genetic variations (2). HGT is one of the driving forces responsible for shaping the prokaryotic genomes and surviving natural selection (3). For example, taking up genes responsible for antimicrobial resistance could be critical in facilitating the organism's adaptation to a particular environment. Contrarily, organisms that have taken up unnecessary genes have more chances of excluding themselves from natural selection if the genes are either neutral or detrimental (4). As much as HGT contributes to the evolution of prokaryotic organisms, it can always complicate the interpretation of the lineage or the history of species evolution, leading to an erroneous result regarding their classifications. Thus, it is fundamental in the research of prokaryotic evolution to understand HGT in depth (5).
In view of this, we constructed HGTree (https://hgtree. snu.ac.kr) (6) in 2015. Other conventional databases such as the HGT-DB (7) or DarkHorse HGT Candidate Resource (8) utilize genome signatures (such as GC bias, nucleotide composition and codon usage) or implicit phylogenetic methods (such as comparison of evolutionary distance derived by sequence similarity). Unlike them, HGTree exploits the tree-reconciliation method, an explicit phylogenetic method that uses species trees and corresponding gene trees to determine horizontally transferred genes. The treereconciliation method is well known for its reliability and is a prevailing method for HGT event detection (9,10). and the HGTree database was able to implement the method while overcoming its challenges (6). As a result, the database was built with 660 840 HGT event results of 2472 prokaryotic genomes.
Since 2015, the amount of prokaryotic genome data has increased dramatically, and so has the requirement for a more comprehensive HGT database with more reliable and efficient processes to detect HGT events. After the publication of HGTree, new methods to detect HGT events have been continuously developed. ShadowCaster hybridized the implicit method with the support-vector-machine (SVM) method (11), and tools such as NearHGT (12) or Recen- To construct the database, 20 179 bacterial and 357 archaeal complete genomes were downloaded from the NCBI. The taxonomy of each genome was reclassified using GTDB-tk. The orthology groups were calculated with POrthoMCL and each group was aligned for phylogenetic gene tree construction. Corresponding species trees were made with 16S rRNA sequences of genomes in the orthology group. Each group's gene tree and species tree are used to perform RANGER-DTL, and putative HGT events and related genes are detected. HGT-related genes' Pfam, antimicrobial resistance, virulence factors, COG clusters and KEGG pathway annotations are also uploaded to the database. (B) The overall algorithm of the user query processor. Users can upload their prokaryotic complete genomes to analyze HGT-related genes in them. In the user query processor, reciprocal blast exploits the HGTree v2.0 database results to calculate new orthology groups that include genes of the input genome. Other parts of the workflow follow the procedures of the previous version, except that users can choose the processor mode (the 'Standard Mode' and 'Strict Mode').
tHGT (13) were developed to detect HGT events between closely related taxa. However, these tools are not suitable for accurately detecting the HGT events among various phyla, leaving the tree-reconciliation method as the most reliable detection method as long as accurate inference trees are supported (14).
Therefore, we present a newly updated HGTree database in response to these demands, coping with the growing interest in HGT. The updated HGTree (referred to as HGTree v2.0) includes approximately eight times larger genomes than the previous version, yielding approximately over six times more putative HGT event results. The newly revised HGT events detection procedure ensured increased detection reliability, and the detected events are presented in the HGTree v2.0 website equipped with more user-friendly interfaces, including the modified user query processor, which allows users to navigate their genomes.

Data processing
The genome data used in this research were retrieved from NCBI using two different search titles: 'Bacteria' and 'Archaea'. Further options included 'Assembly', 'Latest Ref-Seq' (15) and 'Complete genome' (https://www.ncbi.nlm. nih.gov/assembly; May 5, 2021) (16). A total of 20 179 bacterial and 357 archaeal genomes were retrieved at the strain level. Along with the nucleotide and amino acid sequence data, GenBank data were also retrieved, including each genome's size, gene function and the number of coding sequences (CDSs). The taxonomy of each genome was further confirmed and retrieved from the GTDB database (https://gtdb.ecogenomic.org/) (17).
To detect orthologous genes, 20 179 bacterial genomes and 357 archaeal genomes were processed separately using PorthoMCL (18). All-versus-all blast search was performed to detect orthologous genes, and the genes were clustered and assigned to orthology groups. We set the alignment coverage to 80%, an E-value cut-off to 10 -6 and the minimum identity score was 98%. To retrieve 16S rRNA, RNAmmer (ver. 1.2) (19) was used. The program uses an HMMER-(ver. 3.1b2) (20) based scanning procedure to detect 16S rRNA of both bacteria and archaea from the input genome data file.
For each detected orthology group and its corresponding 16S rRNA group, CLUSTAL Omega (ver. 1.2.3) (21) was used for the sequence alignment, and FastTree2 (ver. 2.1.9) (22) was used to construct phylogenetic gene and species trees. As a result, an unrooted gene tree (made of orthologous protein sequences) and a species tree (made of 16S rRNA sequences of genomes from the same orthology group) were constructed for each orthology group. Newick Utility (newick utils ver. 1.6) was used to re-root the species trees using the 18S rRNA sequence from Saccharomyces cerevisiae for the outgroup.
Ranger-DTL 2.0 (23) was used to detect horizontally transferred genes among different orthologous group sets of phylogenic trees. The first round of Ranger-DTL 2.0 was done under default parameters (duplication cost, 2; transfer cost, 3; loss cost, 1) for a standard result. Next, we ran a second round with '3' and '4' for the transfer cost for a more rigorous analysis, leaving the duplication and loss cost as default. The two different transfer cost results were then aggregated as one result in the AggregateRanger step. The results of the first and second rounds were named as the 'Standard DB' and the 'Strict DB', respectively. This process will be further elucidated in the Results and Discussion.

User query processor
We also modified the user query processor of the previous version, and the steps are as follows: first, Prokka (ver. 1.11) (32) and Barrnap (ver. 0.9) (33) are used to predict protein sequences and 16S rRNA of the submitted genome. Second, both the protein sequences of the query genome and HGTree v2.0 data are used to generate two BLAST databases to find the best-hit results by performing the reciprocal BLAST search against each other. For the HGTree v2.0 BLAST database to be used, users can choose between the 'Standard DB' and 'Strict DB' for the BLAST search. The default parameters for the reciprocal BLAST are set as 80% for alignment coverage, 80% for identity score and an E-value cut-off of 10 -6 , but users can choose their parameters too. The best-hit results will be assigned to HGTree v2.0 database orthology groups to generate new orthology groups that include sequences of the query genome. Third, gene trees and corresponding species trees are generated through ClustalO and FastTree2. Finally, Ranger-DTL2 is performed to detect putative horizontally transferred genes from the input genome. In the same way as we built the HGTree v2.0 database, if users choose 'Strict Mode' in the second step, Ranger-DTL will run twice, once with default transfer cost and the secondly time with transfer cost of '3' and '4'. Users will be given a final text file stating the ratio of HGT-related gene events and a list of them, separated into two sections: 'Donated Genes' and 'Received Genes'. Virulence factors, antimicrobial resistance, gene names, product names and the genomes of the donors and recipients will also be included in the text file. The workflows of both database construction and the user query processor are illustrated in Figure 1.

RESULTS AND DISCUSSION
In HGTree v2.0, to keep pace with the ever-inceasing amount of new prokaryotic genome data, we acquired information on putative HGT-related genes through a more accurate tree-reconciliation method. Our major focus of the update was to ensure that the HGT results are more reliable and to show their various functional data in more userfriendly ways. The overview of updates on HGTree v2.0 can be seen in Figure 2.

Enhanced HGT events detection procedures
The orthology grouping step of the previous version was done using Mestortho. Mestortho is a powerful tool that uses the distance method to calculate orthology based on the phylogenetic criterion of minimum evolution (36). However, considering the vast amount of data input for constructing the database, Mestortho was replaced with PorthoMCL due to Mestortho's limitation on speed and time taken to calculate all minimum evolution of 20 536 genomes. PorthoMCL is designed to calculate orthology groups of a considerable number of genomes with its parallelized sparse file structure using a Markov Cluster algorithm (18), so we expected it to be more suitable for our dataset. PorthoMCL utilizes a similar algorithm to Or-thoMCL (37) but, according to Tabari et al., PorthoMCL was able to process at least a five times larger size of the input genome than OrthoMCL, while the percentage of the process speedup was up to 455%. As the detection of horizontally transferred genes relies on the reconciliation of gene and species phylogenetic trees, clarifying each genome's exact lineage and species information is crucial. Therefore, we used a new approach by employing GTDB-tk to reclassify the taxonomy of all Figure 3. The ratio of total genes and HGT events per phylum and the ratio of HGT events which occurred between and within taxa. (A) The relative frequency of prokaryotic genomes' total genes and HGT-related genes was visualized at the phylum level. More HGT-related genes were detected as the total number of genes in the phylum was larger. (B) The x-axis shows the ratio of the HGT-related genes and the y-axis shows the taxonomy of the animal kingdom, from the genus level to the phylum level. The yellow peaks show the ratio of HGT events between different taxa, and the blue peaks show the ratio of those that occurred within the same taxonomy. Both peaks were denser as the taxonomy levels were higher, meaning more genes were transferred between closely related organisms. the genomes for more reliable detection of HGT-related genes. Furthermore, our method was not adequate for detecting HGT events between the strains of the same species since we used 16S rRNA sequences to build phylogenetic trees. Using 16S rRNA made the species trees less clear as it gets down to the strain level because 16S rRNA is a universal species marker for prokaryotes, possibly inducing false-positive results regarding HGT events between the strains from the same species. Constructing phylogenetic trees based on the whole-genome ANI method could have been an alternative, but the ANI method was considered inadequate to be used due to a decrease in discriminatory power among taxa higher than the family level (38). Taking these into consideration, the GTDB-tk reclassification of each genome was important and ensured that no such falsepositive events were included in the output result. GTDB-tk was also used in the user query processor so the taxonomy information of the query genome could be reclassified.
One of the limitations the previous version had was that it missed other possibly calculable optimal reconciliations during the Ranger-DTL step. This problem was solved in the current update as Ranger-DTL 2.0 was used. In the Op-tRoot step, 3 028 789 pairs of trees (rooted species and unrooted gene trees), constructed by aligning the same number of orthologous gene groups, were used to calculate all optimal gene tree rootings to additionally produce 1 500 857 trees, which results in 4 529 646 pairs of trees altogether. The first version of Ranger-DTL (39) did not allow this type of optimal calculation, so the users of HGTree had to inves-tigate their genomes with only one optimal gene tree per orthology group. The subsequent Ranger-DTL step then calculated all optimal tree-reconciliations for the gene and species trees and returned one reconciliation result that was calculated to have the minimum reconciliation cost. For the second round of the Ranger-DTL step (as mentioned above in the 'Data collection and processing' section), as prior research exemplified, the change in transfer costs from '3' to '4' resulted in the detection of more reliable HGT-related genes, but much fewer in number. Those genes also had higher risks of false negatives (40,41). However, by comparing transfer cost '4' results with cost '3' results during the aggregation step, the chances of false negatives could be reduced (41). This was because the AggregateRanger step aggregated all the Ranger-DTL results (per orthology group) into one result and provided the percentage of genes in each orthology group that showed 100% consistent 'events' and 'mapping' among all optimally calculated Ranger results (23). This percentage was additionally used as our filtering value and, out of all aggregated reconciliation results, those that were 90% consistent were filtered.

Expanded database size
We used 20 536 non-redundant prokaryotic genomes (20 179 bacterial genomes and 357 archaeal genomes, 4964 and 264 species, respectively) to detect 3 028 789 orthologous groups and, by using them, we predicted 6 361 199 HGTrelated genes (6 174   for archaeal genomes) ( Table 1). As shown in Figure 3A, we could identify how many HGT-related genes were predicted per phylum, and it was clear that there was a positive correlation between the ratio of phyla and the ratio of predicted HGT-related genes. Nevertheless, some phyla such as Actinobacteriota and Bacteriodota had a higher ratio of predicted HGT events relative to their total gene ratios. Based on this result, we also calculated the frequency of HGT events among each taxon, whether the events occurred within or between taxa ( Figure 3B). As the figure shows, more genes were transferred within taxa, which also concurs with previous research that HGT occurs between closely related organisms more frequently (42)(43)(44).
In the previous version, only available Pfam IDs and COG (Clusters of Orthologous Genes) clusters were annotated to putative horizontally transferred genes. For a further analysis of bacterial horizontally transferred genes, the current update also focused on other functional annotations such as the KEGG (45)  In addition, the metadata of each genome were also retrieved from the NCBI database, which includes the 'Country', 'Isolation Source' and 'Collection Date'. The metadata are displayed on the website together with the world map (further explained in the following 'Upgraded data visualization' section) and will provide further geographical insights to the users.

Upgraded data visualization
In the current update, we revised the website interface so that users can interpret results more thoroughly and quickly. The 'Browse' section provides all putative HGT events and the genes of 20 536 genomes. It also displays other information such as the genomes' taxonomy, isolation source, geographical location and the ratio of HGT-related genes in the genome (the HGT index). Users can choose a genome in the database to see its HGT-related genes (genes in 'Standard DB' and 'Strict DB', separately) and can also view charts illustrating the ratio of other genomes' genera that share the same transferred genes. Furthermore, users can inspect the external links connecting to the website page of Pfam, COG and KEGG database, virulence factor and antimicrobial resistance information by clicking 'Detail'. Like version 1, the HGT relationship plot related to each genome's HGT events and their corresponding phy-logenetic trees are concisely provided, along with the summary of HGT-related genomes in graphs and the world map. Lastly, with the Jbrowse program (46), gene clusters adjacent to the selected genes can be seen. The Jbrowse graph will give biological insights to the users about clusters of horizontally transferred genes by checking whether the surrounding genes are related to each other since HGT could often occur in the form of a gene cluster (or multiple genes) (47).
Nucleic Acids Research, 2023, Vol. 51, Database issue D1017 Unlike the 'Browse' section, the 'Search' section only shows HGT events in the selected genomes. Within this section, we utilized the program 'Circos' (48) to visualize the predicted HGT events between organisms to make it easier for users to comprehend the events and their related genes. As shown in Figure 5, the visualization can provide an intuitive understanding of the transferring genes by pointing out the start location (in the donor's chromosome) and the input location (in the recipient's chromosome) with different colors matched with selected donor genomes. Users can choose the COG clusters, KEGG pathway clusters, virulence factor and antimicrobial resistance to only illustrate those horizontally transferred genes with corresponding functions to figure out what sort of genes could have affected the evolution of microorganisms.

FUTURE WORKS AND CONCLUSION
In our future research, we intend to concentrate on constructing a database that can additionally accommodate data of other types of microorganisms that can demonstrate HGT, such as viruses, fungi and other eukaryotes. Future works will also involve finding a method to predict HGTrelated genes at the strain level since the major drawback of our approach was that we employed 16S rRNA sequences when constructing the species phylogenetic trees. 16S rRNA gives similar (or even identical) sequences for organisms which are more closely related (49); thus, more research into finding an adequate method to specifically clarify species trees at the strain level is still necessary before obtaining more varied information on horizontally transferred genes.
In summary, the HGTree database was updated with a more extensive number of prokaryotic datasets and predicted a much larger number of HGT-related genes than it could do before. Changes in the prediction procedures allowed more reliable detection of putative HGT-related genes. The website interfaces are now more user-friendly with various types of visualization equipment.

DATA AVAILABILITY
The data are available at http://hgtree2.snu.ac.kr.

FUNDING
This work was supported by the Korea Institute of Planning and Evaluation for Technology in Food, Agriculture and Forestry (IPET) through the High Value-added Food Technology Development Program, funded by the Ministry of Agriculture, Food and Rural Affairs (MAFRA) . This work was also supported by the BK21 FOUR Program of Department of Agricultural Biotechnology, Seoul National University, Seoul, Korea. Conflict of interest statement. None declared.